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AUTOMATIC  WORD  ALIGNMENT 

STATEMENT  AS  TO  FEDERALLY  SPONSORED 
RESEARCH 

5 

This  invention  was  made  with  government  support  under 
NBCHC080097  awarded  by  DARPA.  The  government  has 
certain  rights  in  the  invention. 

BACKGROUND  10 

This  invention  relates  to  automatic  word  alignment,  for 
example,  for  use  in  training  of  a  statistical  machine  transla¬ 
tion  (SMT)  system. 

SMT  systems  general  rely  on  translation  rules  obtained 
from  parallel  training  corpora.  In  phrase  based  SMT  systems, 
the  translation  rule  set  includes  rules  that  associate  corre¬ 
sponding  source  language  phrases  and  target  language 
phrases,  which  may  be  referred  to  as  associated  phrase  pairs.  20 
When  a  manually  annotated  corpus  of  associated  phrase  pairs 
is  unavailable  or  inadequate,  a  first  step  in  training  the  system 
includes  identification  and  extraction  of  the  translation  phrase 
pairs,  which  involves  the  induction  links  between  the  source 
and  target  words,  a  procedure  known  as  word  alignment.  The  25 
quality  of  such  word  alignment  can  play  a  crucial  role  in  the 
performance  of  a  SMT  system,  particularly  when  the  SMT 
system  uses  phrase-based  rules. 

SMT  systems  rely  on  automatic  word  alignment  systems  to 
induce  links  between  source  and  target  words  in  a  sentence  30 
aligned  training  corpus.  One  such  technique,  IBM  Model  4, 
uses  unsupervised  Expectation  Maximization  (EM)  to  esti¬ 
mate  the  parameters  of  a  generative  model  according  to  which 
a  sequence  of  target  language  words  is  produced  from  a 
sequence  of  source  language  words  by  a  parametric  random  35 
procedure. 

EM  is  an  iterative  parameter  estimation  process  and  is 
prone  to  errors.  Less  than  optimal  parameter  estimates  may 
result  in  less  than  optimal  alignments  of  the  source  and  target 
language  sentences.  The  quality  of  the  outcome  depends  40 
largely  on  the  number  of  parallel  sentences  available  in  the 
training  corpus  (a  larger  corpus  is  preferable),  and  their  purity 
(i.e.,  mutual  translation  quality).  Thus,  word  alignment  qual¬ 
ity  tends  to  be  poor  for  resource-poor  language  pairs  (e.g., 
English-Pashto  or  English-Dari).  In  some  cases  a  large  pro-  45 
portion  of  words  can  be  incorrectly  aligned  or  simply  left 
unaligned.  This  can  lead  to  inference  of  incorrect  translation 
rules  and  have  an  adverse  effect  on  SMT  performance.  Thus, 
improving  alignment  quality  can  have  a  significant  impact  on 
SMT  accuracy.  50 

Other  work  has  sought  to  improve  word  alignment  quality. 

For  example,  a  number  of  “boosting”  algorithms  have  been 
proposed.  In  some  traditional  boosting  algorithms  (e.g.,  Ada- 
Boost)  for  binary  classification  tasks,  an  iterative  weight 
update  formula  emphasizes  incorrectly  classified  training  55 
samples  and  attenuates  those  that  are  correctly  classified,  in 
effect  “moving”  the  class  boundaries  to  accommodate  the 
misclassified  points.  Classifiers  trained  at  each  boosting  itera¬ 
tion  (also  known  as  weak  learners)  are  combined  to  identify 
class  labels  for  test  samples.  In  many  cases,  this  combination  60 
of  weak  learners  results  in  better  classification  performance 
than  using  a  standard  train/test  approach. 

However,  such  placing  of  emphasis  on  poorly  aligned  sen¬ 
tence  pairs  can  distort  word  alignments  and  reduce  alignment 
quality  over  the  entire  corpus  because  poorly  aligned  sen-  65 
tence  pairs  tend  to  be  lower  quality  or  non- literal  translations 
of  each  other. 
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Additionally,  word  alignment  is  significantly  more  com¬ 
plex  than  simple  binary  classification.  Moreover,  a  direct 
measure  of  alignment  quality  (which  can  be  used  to  update 
weights  for  boosting),  such  as  alignment  error  rate  (AER), 
can  only  be  obtained  from  a  hand-aligned  reference  corpus. 
Another  issue  is  determining  the  best  way  to  combine  align¬ 
ments  from  the  weak  learning  iterations. 

In  one  example,  Wu  et  al.  (“Boosting  statistical  word  align¬ 
ment  using  labeled  and  unlabeled  data,”  Proc.  COLING/ACL, 
Morristown,  N.J.,  USA  pp  913-920)  proposed  a  strategy  for 
boosting  statistical  word  alignment  based  on  a  small  hand- 
aligned  (labeled)  reference  corpus  and  a  pseudo-reference  set 
constructed  from  unlabeled  data.  Theirs  was  a  straightfor¬ 
ward  extension  of  the  AdaBoost  algorithm  using  AER  as  a 
measure  of  goodness.  They  used  a  weighted  majority  voting 
scheme  to  pick  the  best  target  word  to  be  linked  to  each  source 
word  based  on  statistics  gathered  from  the  boosting  iterations . 
On  a  small  scale,  Wu’s  strategy  is  practical,  however,  laiger 
hand-aligned  reference  corpora  are  extremely  expensive  to 
construct  and  very  difficult  to  obtain  for  resource  poor  lan¬ 
guage  pairs. 

In  another  example,  Ananthakrishnan  et  al.  (“Alignment 
entropy  as  an  automated  measure  of  bitext  fidelity  for  statis¬ 
tical  machine  translation,”  ICON  '09:  Proc.  1th  Int.  Conf.  on 
Natural  Lang.  Proc..  December  2009)  proposed  a  technique 
for  automatically  gauging  alignment  quality  using  bootstrap 
resampling.  The  resamples  were  word  aligned  and  a  measure 
of  alignment  variability,  termed  alignment  entropy,  was  com¬ 
puted  for  each  sentence  pair.  The  measure  was  found  to  cor¬ 
relate  well  with  AER.  Subsequently,  they  proposed  a  coarse¬ 
grained  measure  of  phrase  pair  reliability,  termed  phrase 
alignment  confidence,  based  on  the  consistency  of  valid 
phrase  pairs  across  resamples. 

There  is  a  need  for  an  automatic  word  alignment  system 
that  improves  upon  traditional  alignment  techniques  for  the 
purpose  of  creating  corpora,  for  instance,  that  are  more  rep¬ 
resentative  of  hand  aligned  corpora. 

SUMMARY 

In  one  general  aspect,  the  invention  relates  to  an  unsuper¬ 
vised  boosting  strategy  for  refining  automatic  word  align¬ 
ment.  One  of  the  goals  is  to  improve  the  quality  of  automatic 
word  alignment,  for  example  for  resource  poor  language 
pairs,  thus  improving  SMT  performance. 

In  another  aspect,  in  general,  a  method  is  applied  to  align¬ 
ing  linguistic  units  in  paired  sequences  of  units  of  a  stored 
corpus  that  includes  a  plurality  of  paired  sequences  of  units 
formed  from  two  languages.  The  method  includes  determin¬ 
ing  a  plurality  of  weights,  one  for  each  pair  of  the  plurality  of 
paired  sequences  of  units,  and  maintaining  the  weights  in  a 
computer  storage.  A  computer  implemented  procedure  is 
applied  to  iteratively  update  weights.  At  each  iteration,  and 
for  each  pair  of  the  paired  sequences  of  units,  an  alignment  is 
formed  by  aligning  units  in  one  sequence  of  the  pair  with  units 
the  other  sequence  of  the  pair  using  a  parametric  alignment 
procedure  using  a  set  of  alignment  parameters.  A  quality 
score  is  determined  for  the  alignment  for  each  of  the  paired 
sequences  of  units.  The  set  of  alignment  parameters  is 
updated  using  the  alignment  procedure  and  dependent  on  the 
plurality  of  weights  for  the  paired  sequences.  The  plurality  of 
weights  maintained  in  the  computer  storage  is  updated  using 
the  determined  quality  scores  of  the  alignments.  Finally, 
formed  alignments  from  a  plurality  of  the  iterations  are  com¬ 
bined  to  determine  a  combined  alignment  of  units  of  the 
paired  sequences. 
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Aspects  may  include  one  or  more  of  the  following  features. 

The  linguistic  units  comprise  words. 

Hie  method  further  includes  using  the  combined  align¬ 
ments  as  input  to  an  automated  training  procedure  for  a  Sta¬ 
tistical  Machine  Translation  (SMT)  system.  For  instance,  the 
trained  SMT  system  is  used  to  translate  a  sequence  of  units 
from  a  first  of  the  two  language  to  the  other  of  the  two 
languages. 

The  alignment  procedure  comprises  an  iterative  statisti¬ 
cally  based  procedure.  For  instance,  the  iterative  statistically 
based  procedure  comprises  an  Expectation  Maximization 
procedure. 

Updating  the  alignment  parameters  using  the  alignment 
procedure  and  dependent  on  the  plurality  of  weights  for  the 
paired  sequences  includes  weighting  a  contribution  of  each 
paired  sequence  according  to  the  maintained  weight  for  said 
paired  sequence. 

Forming  the  alignment  for  each  of  the  paired  units  includes 
forming  a  first  alignment  of  units  of  the  first  language  to  miits 
of  the  second  language,  and  forming  a  second  alignment  of 
units  of  the  second  language  to  units  of  the  first  language. 

The  alignment  parameters  include  a  first  set  of  parameters 
for  forming  an  alignment  from  the  first  language  to  the  second 
language  and  a  second  set  of  parameters  for  forming  an  align¬ 
ment  from  the  second  language  to  the  first  language. 

Forming  the  alignment  for  each  of  the  paired  units  includes 
combining  the  first  alignment  and  the  second  alignment. 

Combining  the  first  alignment  and  the  second  alignment 
includes  linking  units  that  are  linked  in  each  of  the  first  and 
the  second  alignments. 

Determining  the  quality  score  for  the  alignment  for  each  of 
the  paired  sequences  of  miits  includes  determining  a  normal¬ 
ized  probability  of  producing  units  in  one  sequence  of  the  pair 
from  units  of  the  other  sequence  of  the  pair. 

Determining  the  normalized  probability  includes  deter¬ 
mining  a  geometric  per-unit  average  of  a  product  of  a  prob¬ 
ability  of  producing  a  first  sequence  of  units  of  the  pair  from 
the  second  sequence  of  units  or  the  pair,  and  the  probability  of 
producing  the  second  sequence  of  the  pair  from  first  sequence 
of  the  pair. 

Combining  the  formed  alignments  from  the  plurality  of  the 
iterations  to  determine  the  combined  alignment  of  units  of  the 
paired  sequences  includes  forming  for  each  of  the  paired 
sequences  a  union  of  the  alignments  from  the  plurality  of 
iterations. 

The  steps  are  performed  without  requiring  manual  annota¬ 
tion  of  alignments  of  units  in  the  corpus  of  paired  sequences. 

In  another  aspect,  in  general,  a  training  system  for  machine 
translation  includes  a  storage  for  a  plurality  of  weights,  one 
weight  corresponding  to  each  of  a  plurality  of  paired 
sequences  of  linguistic  units  formed  from  two  languages  in  a 
stored  corpus.  Hie  system  also  includes  a  module  that 
includes  storage  for  a  set  of  alignment  parameters  and  that  is 
configured  to  iteratively  update  the  plurality  of  weights.  At 
each  iteration,  for  each  of  the  paired  sequences  of  units,  an 
alignment  is  formed  by  the  module  by  aligning  units  in  one 
sequence  of  the  pair  with  miits  the  other  sequence  of  the  pair 
using  a  parametric  alignment  procedure  using  the  set  of  align¬ 
ment  parameters.  The  module  is  configured  to  determine  a 
quality  score  for  the  alignment  for  each  of  the  paired 
sequences  of  units,  and  then  update  the  alignment  parameters 
using  the  alignment  procedure  and  dependent  on  the  plurality 
of  weights  for  the  paired  sequences,  and  update  the  plurality 
of  weights  maintained  in  the  computer  storage  using  the 
determined  quality  scores  of  the  alignments.  The  module  is 
further  configured  to  combine  the  formed  alignments  from  a 
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plurality  of  the  iterations  to  determine  a  combined  alignment 
of  units  of  the  paired  sequences. 

In  another  aspect,  in  general,  software  comprises  instruc¬ 
tions  embodied  on  a  tangible  machine  readable  medium  for 
5  causing  a  data  processing  system  to  determine  a  plurality  of 
weights,  one  for  each  of  a  plurality  of  paired  sequences  of 
linguistic  units  formed  from  two  language  in  a  stored  corpus, 
and  maintain  the  weights  in  a  computer  storage.  Hte  system  is 
further  caused  to  iteratively  update  the  plurality  of  weights, 
to  including  at  each  iteration,  for  each  of  the  paired  sequences  of 
units,  form  an  alignment  by  aligning  units  in  one  sequence  of 
the  pair  with  units  in  the  other  sequence  of  the  pair  using  a 
parametric  alignment  procedure  using  a  set  of  alignment 
parameters,  determine  a  quality  score  for  the  alignment  for 
15  each  of  the  paired  sequences  of  units,  update  the  alignment 
parameters  using  the  alignment  procedure  and  dependent  on 
the  plurality  of  weights  for  the  paired  sequences,  and  update 
the  plurality  of  weights  maintained  in  the  computer  storage 
using  the  determined  quality  scores  of  the  alignments.  The 
20  software  further  causes  the  data  processing  system  to  com¬ 
bine  the  formed  alignments  from  a  plurality  of  the  iterations 
to  determine  a  combined  alignment  of  units  of  the  paired 
sequences. 

Embodiments  may  have  one  or  more  of  the  following 
25  advantages. 

The  unsupervised  boosting  strategy  can  automatically  esti¬ 
mate  the  alignment  quality  of  a  parallel  corpus  based  on 
statistics  obtained  from  the  alignment  process  and  emphasize 
sentence  pairs  that  are  potentially  well  aligned.  Sentence 
30  pairs  that  are  potentially  poorly  aligned  are  attenuated.  When 
carried  out  in  tin  iterative  fashion,  well  aligned  sentences  are 
“boosted”  such  that  they  have  a  greater  impact  on  the  align¬ 
ment  statistics.  Thus,  the  contribution  of  unreliable,  poten¬ 
tially  low  quality  translation  pairs  in  the  training  corpus  is 
35  minimized. 

This  approach  can  result  in  fewer  unaligned  words,  a  sig¬ 
nificant  reduction  in  the  number  of  extracted  translation 
phrase  pairs,  a  corresponding  improvement  in  SMT  decoding 
speed,  and  a  consistent  improvement  in  translation  perfor- 
40  mance  across  multiple  language  pairs  and  test  sets.  The 
reduction  in  storage  and  processing  requirements  coupled 
with  improved  accuracy  make  the  proposed  technique  ideally 
suited  for  interactive  translation  services,  facilitating  appli¬ 
cations  such  as  mobile  speech-to-speech  translation. 

45  No  hand-aligned  reference  corpus  is  necessary  for  the  sys¬ 
tem.  This  eliminates  the  significant  time  and  expense  typi¬ 
cally  incurred  in  obtaining  such  a  resource.  Instead,  an  unsu¬ 
pervised  measure  of  alignment  quality  is  used. 

The  word  alignment  system  aggregates  word  alignments 
50  from  all  boosting  iterations  using  a  “union”  operation  rather 
than  voting  and  picking  the  best  target  word  to  be  linked  to  a 
given  source  word.  Hius  translation  accuracy  across  language 
pairs  and  test  sets  is  improved,  while  the  total  number  of 
extracted  translation  rules  (e.g.,  phrase  pairs)  is  reduced.  This 
55  results  in  faster  performance  and  lower  memory  consump¬ 
tion. 

The  algorithm  functions  at  the  word  alignment  level,  and  is 
independent  of  most  SMT  architectures.  The  boosted  word 
alignment  can  be  used  to  train  different  types  of  SMT  sys- 
60  terns,  such  as  phrase-based  (used  in  this  work),  hierarchical, 
and  syntax -based  systems. 

The  algorithm  is  a  heuristic  method  for  creating  a  many  to 
many  linkage  between  parallel  sentence  pairs. 

The  use  of  a  bidirectional  alignment  mitigates  the  impact 
65  of  errors  that  may  occur  in  one  translation  direction. 

Other  features  and  advantages  of  the  invention  are  apparent 
from  the  following  description,  and  from  the  claims. 
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DESCRIPTION  OF  DRAWINGS 

FIG.  1  is  a  block  diagram  of  one  embodiment  of  an  iterative 
boosting  system  for  automatic  word  alignment. 

FIG.  2  is  a  pseudo  code  representation  of  an  iterative  boost¬ 
ing  system  for  automatic  word  alignment. 

FIG.  3  shows  two  example  alignments  of  two  parallel 
sentences.  The  top  example  is  a  baseline  alignment  and  the 
bottom  example  is  a  boosted  alignment. 

FIG.  4  is  a  table  of  the  baseline  and  boosted  system  per¬ 
centage  BLEU  scores  for  E2P  and  P2E  test  sets. 

FIG.  5  is  a  table  comparing  phrase  table  size  and  decoding 
speed. 

DESCRIPTION 

1  Overview 

Referring  to  FIG.  1,  one  embodiment  of  a  word  alignment 
system  100  is  configured  to  implement  an  iterative  boosting 
word  alignment  algorithm.  (Note  that  word  “boosting” 
should  be  understood  only  within  the  context  of  this  descrip¬ 
tion  and  not  to  connote  properties  where  it  is  used  in  other 
contexts.)  The  system  iteratively  refines  automatic  word 
alignment  of  a  parallel  corpus  with  the  goal  of  improving 
performance  of  an  SMT  system  trained  using  the  resulting 
word  alignments.  FIG.  2  is  a  pseudo  code  representation  of 
the  procedure  implemented  by  the  word  alignment  system 
100.  FIGS.  1  and  2  are  referred  to  in  the  overview  below,  with 
more  detailed  description  following  in  subsequent  sections  of 
the  Description. 

Referring  to  FIG.  1,  the  word  alignment  system  100  makes 
use  of  a  set  (S,T)  of  N  paired  sentences  (s^.t,)  (FIG.  2,  line  001 ) 
and  maintains  a  weight  w }  associated  with  each  pair,  updating 
the  weights  from  iteration  to  iteration.  Generally,  a  weight  w; 
represents  a  quality  of  the  pairing  and  alignment  of  the  (s;,t;) 
sentence  pair.  The  weights  at  the  \‘h  iteration  are  referred  to  as 
wi={wv}>  the  initial  weights  w0  all  being  set  to  1 .0  (FIG. 

2,  line  002). 

The  system  100  includes  two  alignment  modules  108, 120 
each  configured  to  accept  a  sentence  paired  parallel  corpus 
106,  118  and  corresponding  alignment  model  parameters 
110,  122.  Generally,  the  alignment  module  108  treats  sen¬ 
tences  in  the  S  set  as  being  from  the  “source”  language  and 
sentences  from  the  T  set  as  from  the  “target”  language.  The 
model  parameters  0^r  110  characterize  a  statistical  model 
that  a  sentence  s,  in  the  source  language  “generates”  a  sen¬ 
tence  t;  in  the  target  language.  The  alignment  module  118 
reverses  the  roles  of  S  and  T  as  “target”  and  “source”,  respec¬ 
tively,  and  make  use  of  a  set  of  model  parameters  0,^s  122. 

As  introduced  above,  the  parallel  corpora  106,  118  are 
weighted  by  a  set  of  weights  104  before  they  are  passed  to  the 
alignment  modules  108,  120.  The  alignment  modules  108, 
120  use  the  weighted  corpora  and  the  alignment  parameters 
110,  122  to  form  updated  word  alignments  112,  124.  An 
alignment  h,  represents  an  alignment  of  words  in  sentence  sy 
with  words  in  sentence  l;  using  the  0^,  parameters,  and  the 
set  of  alignments  determined  at  the  \,h  iteration  is  represented 
as  Bj^T).  Similarly,  an  alignment  c  ■  represents  an  alignment 
of  words  in  sentence  t,  with  words  in  sentence  sy  using  the  0,^ 
parameters,  and  the  set  of  alignments  determined  at  the  \‘h 
iteration  is  represented  as  0,(1', S).  B;.(S,T)  and  C,.(T,S)  are 
later  combined  by  an  alignment  combination  module  116  to 
form  a  bidirectional  alignment  140  at  the  \‘h  iteration,  repre¬ 
sented  as  A,(S,T)  (FIG.  2,  line  004). 

The  alignment  modules  also  compute  at  each  iteration 
updated  parameters  in  the  process  of  forming  the  new  align¬ 
ments.  For  example,  the  new  parameters  0^r  110  character- 
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ize  the  statistical  model  that  generates  a  sentence  t :■  in  the 
target  language  from  a  sentence  sy  in  the  source  language.  The 
procedures  carried  out  by  alignment  module  120  are  gener¬ 
ally  the  same,  with  the  roles  or  the  source  and  target  languages 
5  reversed. 

After  all  the  paired  sentences  have  been  aligned  in  an 
iteration,  the  quality  of  each  of  the  word  alignments  112, 124 
is  assessed  by  an  alignment  quality  assessment  module  129 
and  these  alignment  qualities  are  used  to  update  the  set  of 
to  weights  104.  In  this  example,  the  quality  of  an  alignment  is 
determined  according  to  the  probability  of  the  generated 
word  sequence.  For  example,  the  quality  of  an  alignment  b  ■  is 
computed  as  Pe  (t,-,s  ). 

The  ‘boosting’  process  by  which  the  weights  are  updated  is 
15  repeated  M  times  in  an  iterative  loop  process  102,  with  the 
index  i  maintaining  the  number  of  iterations  completed  by  the 
loop  process  (FIG.  2,  line  003).  The  bidirectional  alignment 
140  is  accumulated  by  an  accumulation  module  128  at  each 
iteration  of  the  loop  102.  When  the  loop  102  completes  M 
20  iterations,  a  final  alignment  is  formed  by  merging  the  accu¬ 
mulated  bidirectional  alignments  using  a  union  module  132 
(FIG.  2,  line  010).  The  final  alignment  is  then  provided  to 
downstream  systems  for  further  SMT  training  134. 

2  Parallel  Corpora 

25  The  first  parallel  corpus  (S,T)  1 06  is  a  body  of  text  S  written 
in  a  first  language  is  associated  with  a  body  of  text  T  in  a 
second  language  on  a  sentence  by  sentence  basis.  The  second 
parallel  corpus  (T,S)  108  is  substantially  the  same  as  the  first 
parallel  corpus  106  with  the  exception  that  the  roles  of  S  and 
30  T  are  reversed  to  facilitate  a  bidirectional  alignment.  The 
parallel  corpora  106,  118  each  includes  N  sentence  pairs. 
(FIG.  2,  line  001).  Note  that  the  system  does  not  require  word 
or  phrase  level  alignments  in  the  corpora,  and  the  system  is 
tolerant  of  a  range  of  quality  of  the  pairing  of  the  sentences. 
35  2.1  Weights 

Prior  to  providing  the  parallel  corpora  106,  118  to  the 
alignment  modules  108,  120,  the  corpora  106,  118  are 
weighted  by  the  set  of  weights  104.  (FIG.  2,  line  002)  The  set 
of  weights  104  includes  N  scalar  weights,  each  weight  corre- 
40  sponding  to  one  of  the  sentence  pairs  in  the  parallel  corpora 
106,  118.  The  same  set  of  weights  104  is  applied  to  both 
parallel  corpora  106,  118.  The  first  boosting  iteration  uses 
equal  (unit)  weight  to  each  sentence  pair  of  the  parallel  cor¬ 
pora  106,118,  and  subsequent  iterations  use  updated  weights . 
45  2.2  Alignment  Modules 

At  the  \'h  iteration  of  the  loop  102,  a  set  of  aligmnents  B, 
112  is  obtained  by  providing  the  weighted  parallel  corpus 
(S,T)  106  to  the  alignment  module  108  along  with  the  sets  of 
alignment  parameters  Qs_t  110.  (FIG.  2,  line  004)  The  align- 
50  ment  module  108  is  configured  to  analyze  each  of  the  sen¬ 
tence  pairs  (s7,t,)  included  in  the  weighted  parallel  corpus  106 
and  determine  words  in  a  target  sentence  t,  that  correspond  to 
words  in  a  source  sentence  s  -.  The  associations  of  a  word  in 
the  source  sentence  to  corresponding  words  in  the  target 
55  sentence  is  called  a  link.  The  alignment  module  108  also 
determines  an  alignment  probability  p(t/,s/),  which  is  the  joint 
probability  of  the  target  sentence  t ;.  and  the  source  sentence  Sj 
using  the  most  likely  alignment  B„  given  the  alignment 
parameters  0^,  110  of  the  alignment  model. 

60  The  word  alignment  system  100  is  configured  to  generate 
at  the  ith  iteration  a  set  of  alignments  C/ 124,  which  includes 
the  links  determined  from  sentences  in  language  T  to  sen¬ 
tences  in  language  S  (i.e.,  a  backward  alignment).  These 
backward  alignments  are  determined  by  an  alignment  module 
65  1  20,  which  performs  the  same  procedures  as  the  other  align¬ 
ment  module  108,  but  uses  a  separate  set  of  parameters  0, 
122,  and  uses  the  second  weighted  corpus  118  as  input. 
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Together  the  first  (“forward”)  alignment  112  and  the  sec¬ 
ond  (“backward”)  alignment  124  are  referred  to  as  a  bidirec¬ 
tional  alignment.  In  some  examples,  the  links  of  the  forward 
and  backward  alignment  are  combined  in  a  heuristic  fashion 
by  an  alignment  combination  module  116,  such  that  links  of 
the  combined  alignment  are  the  intersection  of  the  links  of  the 
forward  and  backward  alignments. 

The  alignment  modules  108, 120  compute  updated  param¬ 
eters  110, 122  during  the  computation  of  the  alignments  112, 
124.  Each  of  the  pairs  of  training  sentences  (s,,ty)  and  (Qs) 
contribute  to  the  updated  parameters  based  on  the  weight  w 
of  the  pair,  such  that  pairs  with  low  weight  contribute  less  to 
the  updated  parameters  than  pairs  with  higher  weight.  Note 
that  in  the  first  iteration,  because  all  pairs  have  the  same  unit 
weight,  all  pairs  contribute  equally. 

In  some  embodiments,  the  alignment  modules  implement 
the  IBM  Model  4  algorithm.  (FIG.  2,  line  004) 

2.3  Accumulation  Modules 

The  bidirectional  word  alignments  A,  140  produced  by  the 
alignment  combination  module  116  are  accumulated  over  the 
iterations  by  an  accumulation  module  128.  The  complete  set 
of  bidirectional  word  alignments  140  generated  within  the 
iterative  loop  process  102  are  used  by  later  modules  of  the 
system  100. 

2.4  Alignment  Quality  Assessment 

The  set  ofN  alignment  probabilities  114, 126  is  associated 
with  from  each  of  the  alignments  112,  124  are  passed  to  an 
alignment  quality  assessment  module  129.  The  alignment 
quality  assessment  module  129  is  configured  to  calculate  a 
measure  of  the  bidirectional  alignment  quality  from  the  align¬ 
ments  112,  124.  Thus,  for  each  sentence  pair  of  each  align¬ 
ment  112,  124,  an  unsupervised  measure  of  word  alignment 
quality  for  boosting  is  calculated.  (FIG.  2,  line  005). 

In  the  present  embodiment,  for  each  sentence  pair,  the 
forward  alignment  probability  p(t/ls/)  and  backward  align¬ 
ment  probability  p(s7lt/.)  are  combined  and  sentence-length 
normalized  to  determine  a  score,  which  provides  a  good  cor¬ 
relate  of  alignment  quality.  In  some  examples,  this  combined 
and  normalized  score  is  computed  as  a  geometric  mean: 

-4  ?,)=exp  ((In  («,- 1  Ol+in  /?  (/,  l^,)V(  Is,- 1 + 1  /,■  I )) 

where  IsQ  and  IQ  are  the  lengths  of  the  sentences  (in  words). 

In  embodiments  that  make  use  of  the  IBM  Model  4  align¬ 
ment  process,  each  source  word  is  linked  to  exactly  one  target 
word  (which  may  be  the  empty  word  NULL),  therefore  the 
number  of  allowable  links  in  the  forward  and  backward  align¬ 
ments  is  simply  the  total  number  of  source  and  target  words  in 
the  sentence  pair  (s;,t;).  Therefore,  each  of  the  scores  A/,z(s;,t/) 
is  in  the  range  0.0  to  1 .0. 

2.5  Update  Set  of  Weights 

An  updated  set  of  weights  130  is  generated  by  using  the 
result  of  the  alignment  quality  assessment  module  129  to 
modify  the  set  of weights  104.  The  updated  set  of  weights  130 
is  used  to  weight  the  parallel  corpora  106,  118  in  the  next 
iteration  of  the  loop  102. 

Specifically,  FIG.  2,  lines  006-008  present  a  detailed  set  of 
equations  for  updating  the  set  of  weights  104.  The  weighted 
average  qualify  score  over  the  entire  parallel  corpus  is  com¬ 
puted  as: 

rPnV./yA' 

where  PL, .  is  the  qualify  score  A7,;(s/,t/)  computed  in  the  \th 
iteration  using  the  weights  w,,^  determined  in  the  previous 
iteration.  Using  the  IBM  Model  4  procedure,  6,  is  in  the  range 
0.0  to  1.0.  A  scale  factor  a,  is  computed  from  6,  as  a,=0.5 
1  n ( (1  -dj/o,).  The  new  weights  are  then  determined  by  scaling 
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each  prior  weight  v/ul  ■  by  exp(a,PLl!7)  and  then  multiplica- 
tively  normalizing  by  a  divisor  Z  so  that  the  sum  of  the  new 
weights  is  again  N. 

2.6  Union  of  Alignments 

5  When  the  iterative  loop  102  completes  M  iterations,  the 
bidirectional  word  alignments  140  which  were  accumulated 
by  the  accumulation  module  128  are  provided  to  a  union 
module  132  (FIG.  2,  line  010).  The  union  module  132  ana¬ 
lyzes  all  of  the  accumulated  alignments  and  creates  a  final 
to  alignment  by  aggregating  word  alignments  from  all  boosting 
iterations  using  a  “union”  operation.  Therefore,  two  words 
are  linked  if  there  is  both  a  forward  link  and  a  backward  link 
between  the  words  at  any  iteration  of  the  process. 

Each  iteration  of  the  iterative  loop  102  can  result  in  a 
15  distinct  word  alignment  that  may  be  different  from  all  others 
(i.e.,  includes  a  distinct  set  of  links)  due  to  the  changes  in  the 
set  of  weights  104  from  one  iteration  to  the  next.  The  differ¬ 
ences  between  the  bidirectional  word  alignments  are  recon¬ 
ciled  for  translation  phrase  pair  extraction.  The  differences 
20  can  be  reconciled  by  calculating,  for  each  sentence  pair,  the 
union  of  source-target  word  alignment  links  across  all  boost¬ 
ing  iterations.  The  union  module  132  combines  the  weak 
learners  by  taking,  for  each  sentence  pair,  the  union  of  the 
accumulated  word  alignments  obtained  from  the  forward  and 
25  backward  alignments  at  each  iteration.  The  resulting  final 
alignment  includes  far  fewer  unaligned  source  and  target 
words  than  any  of  the  individual  alignments  and  is  more 
robust  to  errors  (e.g.,  a  link  missing  from  the  baseline  align¬ 
ment  could  be  present  in  one  or  more  of  the  boosted  versions) . 
30  The  final  alignment  is  passed  on  to  later  SMT  training 
algorithms  134  that  can  be  configured  to  extract  translations 
rules  such  as  phrase  pairs  from  merged  bidirectional  (source- 
to-target  and  target-to-source)  alignments. 

Referring  to  FIG.  3,  a  baseline  alignment  of  a  sentence  pair 
35  302  is  compared  to  a  final  bidirectional  alignment  of  the  same 
sentence  pair  304  for  an  English-to-Pashto  translation  task. 
The  Pashto  sentence  is  represented  in  Buckwalter  notation,  an 
ASCII-based  encoding  for  languages  using  the  Arabic  script. 
Alignments  such  as  these  302,  304  are  used  by  a  phrase  pair 
40  extraction  algorithm  to  create  translation  phrase  tables. 

For  example,  the  heuristic  phrase  pair  extraction  algorithm 
described  by  Koehn  et  al.  (“Statistical  phrase-based  transla¬ 
tion,”  in  NAACL  ’03:  Proc.  2003  Conf.  of  the  N.  American 
Chapter  of  the  Assoc,  for  Comp.  Linguistics  on  Human  Lan- 
45  guage  Technology)  is  used  to  build  a  translation  phrase  table 
from  the  bidirectional  baseline  and  union  of  boosted  align¬ 
ments.  The  phrase  table  encodes  translation  phrase  pairs  and 
their  associated  statistics,  which  are  used  by  the  SMT  system 
(decoder)  in  conjunction  with  other  parameters,  as  described 
50  below. 

3  Phrase-Based  SMT  System  Results 

In  the  present  embodiment,  the  final  word  alignment  is 
provided  to  a  phrase  based  SMT  system.  Hie  system  uses  a 
log-linear  model  of  various  features  (translation  probabili- 
55  ties,  language  model  probabilities,  distortion  penalty,  etc.)  to 
estimate  the  posterior  probability  of  various  target  hypotheses 
given  a  source  sentence.  The  hypothesis  with  the  highest 
posterior  probability  is  chosen  as  the  translation  output  as  is 
illustrated  by  the  following  equation. 

60  The  proposed  word  alignment  boosting  strategy  was  evalu¬ 

ated  in  the  context  of  English-to-Pashto  (E2P)  and  Pashto-to- 
English  (P2E),  a  low-resource  language  pair.  For  E2P,  the 
training  and  hming  consisted  of  220  k,  2.4  k  sentence  pairs, 
respectively.  For  P2E,  the  corresponding  corpus  sizes  were 
65  2  36  k  and  2.1k  sentence  pairs.  Two  unseen  test  sets  were  used 
for  both  directions.  The  E2P  test  sets  included  T1  E2P,  a  test 
set  of  1 . 1  k  sentences  with  one  reference  translation  each,  and 
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T2  E2P,  a  test  set  of  564  sentences  with  four  reference  trans¬ 
lations  per  sentence.  The  P2E  test  sets  included  T1  P2E, 
consisting  of  1.1  k  sentences  with  one  reference  translation 
each,  and  T2  P2E,  containing  547  sentences  with  four  refer¬ 
ence  translations  each.  The  multi-reference  test  sets  came 
from  the  official  DARPA  TRANSTAC  evaluations  conducted 
by  NIST. 

First,  baseline  SMT  systems  were  trained  for  both  direc¬ 
tions.  The  first  step  was  to  obtain  forward  and  backward  IBM 
Model  4  word  alignment  for  the  parallel  training  set  using 
GIZA++.  These  were  merged  to  produce  bidirectional  align¬ 
ments  for  phrase  pair  extraction  as  described  in  Koehn  et  al. 
Target  language  models  (LMs)  were  trained  using  all  avail¬ 
able  data  for  English  and  Pashto,  including  target  sentences 
from  the  corresponding  parallel  corpora.  The  LMs  were  fixed 
across  all  translation  experiments  described  in  this  section. 
The  tuning  sets  were  used  to  optimize  SMT  decoder  feature 
weights  for  E2P  and  P2E  using  MERT  to  maximize  BLEU. 
Translation  performance  was  then  evaluated  on  all  test  sets  in  20 
both  directions  using  BLEU  as  a  measure  of  translation  accu¬ 
racy. 

Subsequently,  trained  phrase  tables  were  trained  from  the 
union  of  boosted  alignments  obtained  as  described  above  for 
both  directions.  Twenty  boosting  iterations  were  performed.  25 
Decoder  feature  weights  were  re-tuned  (with  the  same  LMs 
and  optimization  starting  points  as  the  baseline)  using  MERT. 
Finally,  translation  performance  of  the  boosted  SMT  system 
was  compared  to  the  baseline  system  across  all  test  sets  for 
E2P  and  P2E.  The  BLEU  scores  are  summarized  in  FIG.  4.  30 

Referring  to  FIG.  4,  with  identical  decoding  parameters 
and  paining  settings  the  proposed  boosting  strategy  outper¬ 
formed  the  baseline  system  by  0.6%  BLEU  on  both  test  sets 
in  the  E2P  direction;  for  P2E,  a  0.3%  improvement  on  the 
single-reference  test  set  was  obtained  and  a  0.9%  gain  on  the  35 
multi-reference  set.  These  improvements  are  consistent 
across  multiple  test  sets  in  both  directions. 

Compared  to  the  baseline  word  alignment,  the  union  of 
boosted  alignments  expectedly  had  a  lower  proportion  of 
unaligned  source  and  target  words  across  language  pairs,  as  40 
shown  in  FIG.  5.  As  a  result,  the  number  of  translation  phrase 
pairs  extracted  from  the  union  of  boosted  alignments  was 
significantly  lower  than  that  obtained  from  the  baseline  sys¬ 
tem.  The  total  number  of  phrase  pairs  in  the  E2P  and  P2E 
directions  decreased  by  52.6%  and  50.8%,  respectively.  This  45 
led  to  a  corresponding  reduction  in  their  storage  footprint,  as 
summarized  in  FIG.  5. 

In  order  to  gauge  the  improvement  in  translation  speed  as  a 
result  of  the  smaller  phrase  tables,  the  additional  experiment 
of  decoding  the  multi  reference  test  sets  T2  E2P  and  T2  P2E  50 
with  our  already  highly  efficient  phrase-based  decoder  was 
performed  on  the  Google  Nexus  One  smart  phone. 

The  comparison  of  decoding  speeds  is  also  summarized  in 
FIG.  5.  Using  identical  hypothesis  paining  settings,  decoding 
speed  increased  from  52.6  words/second  to  57.2  words/sec-  55 
ond  (an  increase  of  8.7%)  for  E2P,  and  from  50.4  words/ 
second  to  54.9  words/second  (an  8.9%  improvement)  for 
P2E. 

Thus,  the  proposed  boosting  technique  achieves  the  dis¬ 
tinction  of  improving  translation  accuracy,  while  simulta-  60 
neously  reducing  storage  requirements  and  decoding  time 
over  an  already  highly  speed-tuned  baseline.  However,  no 
significant  reduction  in  search  space  or  memory  consumption 
was  observed  when  using  the  boosted  phrase  table.  This  indi¬ 
cates  that  most  of  the  speed  gains  come  from  faster  search  65 
graph  constaiction,  given  the  number  of  translation  options 
for  a  given  source  phrase  is  reduced  by  a  factor  of  two. 
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4  Implementations  and  Alternatives 

Embodiments  of  the  approaches  described  above  may  be 
implemented  in  software,  in  hardware,  or  in  a  combination  of 
hardware  and  software.  Software  implementations  can 
5  include  instructions  stored  on  computer-readable  media  for 
causing  one  or  more  data  processing  systems  to  perform  the 
functions  described  above.  In  some  implementations,  a  single 
data  processing  system  may  be  used,  while  in  other  imple¬ 
mentations,  multiple  data  processing  systems  (e.g.,  comput- 
10  ers)  may  be  used  in  a  centralized  and/or  distributed  imple¬ 
mentation. 

Examples  described  above  do  not  necessarily  assume  any 
prior  knowledge  regarding  the  quality  of  the  sentence  pairs .  In 
other  examples,  prior  knowledge,  for  example,  based  on 
15  human  review  may  be  used  by  assigning  non-uniform 
weights  before  the  first  iteration. 

The  specific  computations  described  above  for  updating 
the  weights  of  sentence  pairs  are  only  examples.  Other  similar 
approaches  may  be  used  without  departing  from  the  spirit  of 
the  overall  approach.  For  example,  other  computations  can 
achieve  the  result  of  increasing  the  weighting  of  relatively 
reliable  sentence  pairs  while  reducing  the  weight  of  unreli¬ 
able  pairs. 

Other  approaches  for  combining  the  alignments  from  dif¬ 
ferent  iterations  can  also  be  used  rather  than  forming  the 
union.  For  example,  only  a  limited  number  of  iterations  can 
be  combined,  and  consistency  of  alignment  from  iteration  to 
iteration  may  be  taken  into  account. 

Other  alignment  procedures  can  also  be  used  in  place  of 
IBM  Model  4  (e.g.,  IBM  Model  1,  HMM  alignment,  etc.). 

It  is  to  be  understood  that  the  foregoing  description  is 
intended  to  illustrate  and  not  to  limit  the  scope  of  the  inven¬ 
tion,  which  is  defined  by  the  scope  of  the  appended  claims. 
Other  embodiments  are  within  the  scope  of  the  following 
claims. 

What  is  claimed  is: 

1.  A  method  for  aligning  linguistic  units  in  paired 
sequences  of  units  of  a  stored  corpus  comprising  a  plurality  of 
paired  sequences  of  units  formed  from  two  languages,  the 
method  comprising: 

determining  a  plurality  of  weights,  one  for  each  pair  of  the 
plurality  of  paired  sequences  of  units,  and  maintaining 
the  weights  in  a  computer  storage; 
applying  a  computer  implemented  procedure  to  iteratively 
update  the  plurality  of  weights,  including  at  each  itera¬ 
tion 

for  each  pair  of  the  paired  sequences  of  units,  forming  an 
alignment  including  aligning  units  in  one  sequence  of 
the  pair  with  units  of  the  other  sequence  of  the  pair 
using  a  parametric  alignment  procedure  using  a  set  of 
alignment  parameters, 

determining  a  quality  score  for  the  alignment  for  each  of 
the  paired  sequences  of  units, 
updating  the  set  of  alignment  parameters  using  the  align¬ 
ment  procedure  and  dependent  on  the  plurality  of 
weights  for  the  paired  sequences,  wherein  the  set  of 
alignment  parameters  are  updated  such  that  paired 
sequences  of  units  with  weights  representing  a  higher 
quality  of  alignment  are  emphasized  as  compared  to 
paired  sequences  of  units  with  weights  representing  a 
lower  quality  of  alignment,  and 
updating  the  plurality  of  weights  maintained  in  the  com¬ 
puter  storage  using  the  determined  quality  scores  of 
the  alignments;  and 

combining  the  fonned  alignments  from  a  plurality  of  the 
iterations  to  detennine  a  combined  alignment  of  units  of 
the  paired  sequences. 
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2.  The  method  of  claim  1  wherein  the  linguistic  units  com¬ 
prise  words. 

3.  The  method  of  claim  1  further  comprising: 

using  the  combined  alignments  as  input  to  an  automated 
training  procedure  for  a  Statistical  Machine  Translation  5 
(SMT)  system. 

4.  The  method  of  claim  3  further  comprising: 

using  the  trained  SMT  system  to  translate  a  sequence  of 
units  from  a  first  of  the  two  languages  to  the  other  of  the 
two  languages.  to 

5.  The  method  of  claim  1  wherein  the  alignment  procedure 
comprises  an  iterative  statistically  based  procedure. 

6.  The  method  of  claim  5  wherein  the  iterative  statistically 

based  procedure  comprises  an  Expectation  Maximization 
procedure.  15 

7.  The  method  of  claim  1  wherein  updating  the  alignment 
parameters  using  the  alignment  procedure  and  dependent  on 
the  plurality  of  weights  for  the  paired  sequences  includes 
weighting  a  contribution  of  each  paired  sequence  according 

to  the  maintained  weight  for  said  paired  sequence.  20 

8.  The  method  of  claim  1  wherein  forming  the  alignment 

for  each  of  the  paired  units  includes  forming  a  first  alignment 
of  units  of  the  first  language  to  units  of  the  second  language, 
and  forming  a  second  alignment  of  units  of  the  second  lan¬ 
guage  to  units  of  the  first  language.  25 

9.  The  method  of  claim  8  wherein  the  alignment  param¬ 

eters  include  a  first  set  of  parameters  for  forming  an  alignment 
from  the  first  language  to  the  second  language  and  a  second 
set  of  parameters  for  forming  an  alignment  from  the  second 
language  to  the  first  language.  30 

10.  The  method  of  claim  8  wherein  forming  the  alignment 
for  each  of  the  paired  units  includes  combining  the  first  align¬ 
ment  and  the  second  alignment. 

11.  The  method  of  claim  9  wherein  combining  the  first 
alignment  and  the  second  alignment  includes  linking  units  35 
that  are  linked  in  each  of  the  first  and  the  second  alignments. 

12.  The  method  of  claim  1  wherein  determining  the  quality 
score  for  the  alignment  for  each  of  the  paired  sequences  of 
units  includes  determining  a  nomialized  probability  of  pro¬ 
ducing  units  in  one  sequence  of  the  pair  from  units  of  the  other  40 
sequence  of  the  pair. 

13.  The  method  of  claim  12  wherein  determining  the  nor¬ 
malized  probability  includes  determining  a  geometric  per- 
unit  average  of  a  product  of  a  probability  of  producing  a  first 
sequence  of  units  of  the  pair  from  the  second  sequence  of  45 
units  or  the  pair,  and  the  probability  of  producing  the  second 
sequence  of  the  pair  from  first  sequence  of  the  pair. 

14.  The  method  of  claim  1  wherein  combining  the  formed 
alignments  from  the  plurality  of  the  iterations  to  determine 
the  combined  alignment  of  units  of  the  paired  sequences  50 
includes  forming  for  each  of  the  paired  sequences  a  union  of 
the  alignments  from  the  plurality  of  iterations. 

15 .  The  method  of  claim  1 ,  wherein  the  steps  are  performed 

without  requiring  manual  annotation  of  alignments  of  any  of 
the  units  in  the  corpus  of  paired  sequences.  55 

16.  A  training  system  for  machine  translation  comprising: 

a  storage  for  a  plurality  of  weights,  one  weight  correspond¬ 
ing  to  each  of  a  plurality  of  paired  sequences  of  linguistic 
units  formed  from  two  languages  in  a  stored  corpus;  and 
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a  module  including  a  storage  for  a  set  of  alignment  param¬ 
eters  and  configured  to  iteratively  update  the  plurality  of 
weights,  including  at  each  iteration 
for  each  of  the  paired  sequences  of  units,  form  an  align¬ 
ment  including  aligning  units  in  one  sequence  of  the 
pair  with  units  the  other  sequence  of  the  pair  using  a 
parametric  alignment  procedure  using  the  set  of  align¬ 
ment  parameters, 

determine  a  quality  score  for  the  alignment  for  each  of 
the  paired  sequences  of  units, 
update  the  alignment  parameters  using  the  alignment 
procedure  and  dependent  on  the  plurality  of  weights 
for  the  paired  sequences,  wherein  the  set  of  alignment 
parameters  are  updated  such  that  paired  sequences  of 
units  with  weights  representing  a  higher  quality  of 
alignment  are  emphasized  as  compared  to  paired 
sequences  of  units  with  wei gilts  representing  a  lower 
quality  of  alignment,  and 

update  the  plurality  of  weights  maintained  in  the  com¬ 
puter  storage  using  the  determined  quality  scores  of 
the  alignments;  and 

wherein  the  module  is  further  configured  to  combine  the 
formed  alignments  from  a  plurality  of  the  iterations  to 
determine  a  combined  alignment  of  units  of  the  paired 
sequences. 

17.  Software  comprising  instructions  embodied  on  a  non- 
transitory  machine  readable  medium  for  causing  a  data  pro¬ 
cessing  system  to: 

determine  a  plurality  of  wei  gilts,  one  for  each  of  a  plurality 
of  paired  sequences  of  linguistic  units  formed  from  two 
language  in  a  stored  corpus,  and  maintain  the  weights  in 
a  computer  storage; 

iteratively  update  the  plurality  of  weights,  including  at 
each  iteration 

for  each  of  the  paired  sequences  of  units,  form  an  align¬ 
ment  including  aligning  units  in  one  sequence  of  the 
pair  with  units  the  other  sequence  of  the  pair  using  a 
parametric  alignment  procedure  using  a  set  of  align¬ 
ment  parameters, 

determine  a  quality  score  for  the  alignment  for  each  of 
the  paired  sequences  of  units, 
update  the  alignment  parameters  using  the  alignment 
procedure  and  dependent  on  the  plurality  of  weights 
for  the  paired  sequences,  wherein  the  set  of  alignment 
parameters  are  updated  such  that  paired  sequences  of 
units  with  weights  representing  a  higher  quality  of 
alignment  are  emphasized  as  compared  to  paired 
sequences  of  units  with  weights  representing  a  lower 
quality  of  alignment,  and 

update  the  plurality  of  weights  maintained  in  the  com¬ 
puter  storage  using  the  determined  quality  scores  of 
the  alignments;  and 

combine  the  formed  alignments  from  a  plurality  of  the 
iterations  to  determine  a  combined  alignment  of  units  of 
the  paired  sequences. 


